Skip to content

[SPARK-49459][CORE][SHUFFLE] Support CRC32C for Shuffle Checksum#47929

Closed
yaooqinn wants to merge 4 commits intoapache:masterfrom
yaooqinn:crc32c
Closed

[SPARK-49459][CORE][SHUFFLE] Support CRC32C for Shuffle Checksum#47929
yaooqinn wants to merge 4 commits intoapache:masterfrom
yaooqinn:crc32c

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

@yaooqinn yaooqinn commented Aug 29, 2024

What changes were proposed in this pull request?

This PR adds (java.util.zip.)CRC32C to spark.shuffle.checksum.algorithm. CRC32C has been supported by JDK since 9.

/*
* This CRC-32C implementation uses the 'slicing-by-8' algorithm described
* in the paper "A Systematic Approach to Building High Performance
* Software-Based CRC Generators" by Michael E. Kounavis and Frank L. Berry,
* Intel Research and Development
*/

Why are the changes needed?

CRC32C performs better on some SIMD CPU instruction sets

Does this PR introduce any user-facing change?

Yes, spark.shuffle.checksum.algorithm can be set to CRC32C.

How was this patch tested?

I tested this via benchmark

================================================================================================
Benchmark Checksum Algorithms
================================================================================================

OpenJDK 64-Bit Server VM 17.0.12+0 on Mac OS X 14.6.1
Apple M2 Max
Checksum Algorithms:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
CRC32                                              4145           4190          46          0.0     4047834.9       1.0X
CRC32C                                             4115           4155          35          0.0     4018904.7       1.0X
Adler32                                            1961           1972          16          0.0     1914619.1       2.1X
PureJavaCrc32C                                    18115          18322         245          0.0    17690350.5       0.2X

Was this patch authored or co-authored using generative AI tooling?

no

@yaooqinn
Copy link
Copy Markdown
Member Author

cc @Ngone51 @dongjoon-hyun @cloud-fan thank you

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-49459][CORE][SHUFFLE] Support CRC32C for Shuffle Checksum [SPARK-49459][CORE][SHUFFLE] Support CRC32C for Shuffle Checksum Aug 30, 2024
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @yaooqinn . This looks like a good addition for Apache Spark 4.0.0.

Merged to master.

@yaooqinn yaooqinn deleted the crc32c branch September 2, 2024 05:36
@yaooqinn
Copy link
Copy Markdown
Member Author

yaooqinn commented Sep 2, 2024

Thank you @dongjoon-hyun

dongjoon-hyun added a commit that referenced this pull request Jan 21, 2025
### What changes were proposed in this pull request?

This PR aims to add `CRC32C` test cases.

### Why are the changes needed?

Apache Spark supports `CRC32C`. We had better add more test coverage like `CRC32`.
- #47929

### Does this PR introduce _any_ user-facing change?

No. This is a test case addition.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49582 from dongjoon-hyun/SPARK-50902.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Jan 21, 2025
### What changes were proposed in this pull request?

This PR aims to add `CRC32C` test cases.

### Why are the changes needed?

Apache Spark supports `CRC32C`. We had better add more test coverage like `CRC32`.
- #47929

### Does this PR introduce _any_ user-facing change?

No. This is a test case addition.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49582 from dongjoon-hyun/SPARK-50902.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 98f2767)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
### What changes were proposed in this pull request?

This PR aims to add `CRC32C` test cases.

### Why are the changes needed?

Apache Spark supports `CRC32C`. We had better add more test coverage like `CRC32`.
- apache#47929

### Does this PR introduce _any_ user-facing change?

No. This is a test case addition.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49582 from dongjoon-hyun/SPARK-50902.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 2233537)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants